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Abstract 

Increasing popularity of mobile route planning applications based on GPS tech¬ 
nology provides opportunities for collecting traffic data in urban environments. One 
of the main challenges for travel time estimation and prediction in such a setting is 
how to aggregate data from vehicles that have followed different routes, and predict 
travel time for other routes of interest. One approach is to predict travel times for 
route segments, and sum those estimates to obtain a prediction for the whole route. 
We study how to obtain optimal predictions in this scenario. It appears that the 
optimal estimate, minimizing the expected mean absolute error, is a combination 
of the mean and the median travel times on each segment, where the combination 
function depends on the number of segments in the route of interest. We present a 
methodology for obtaining such predictions, and demonstrate its effectiveness with 
a case study using travel time data from a district of St. Petersburg collected over 
one year. The proposed methodology can be applied for real-time prediction of 
expected travel times in an urban road network. 


1 Introduction 

Traffic congestions are very common in modern urban environments. Higher than ever 
penetration of mobile sensing technologies allows collecting real time traffic data from 
many users at relatively low costs. Such data can be used for providing real time 
information about traffic conditions. In addition, data accumulated over time can be 
used for modeling traffic patterns, and making predictions about traffic in the nearest 
future that can help in travel planning for individuals, and contribute towards mitigating 
traffic congestions. 

One of the main challenges in urban travel time prediction is that vehicles follow 
different routes, and traffic conditions are rapidly changing. Suppose we are interested 
to predict current travel time for a given route. Probably very few or no drivers has 
recently traveled exactly the same route, so no direct data is available for making predic¬ 
tions. However, it is very likely that a number of vehicles have passed through different 
segments on the route of interest, and we have data on recent travel times on separate 
segments. We study how to optimally combine predictions made on individual segments 
into a prediction for the whole route of interest. It turns out that a simple sum of predic¬ 
tions for individual segments is not optimal due to the nature of travel time distribution. 
We propose a methodology for estimating travel time on individual segments, and how 
to combine these estimates in an optimal way that would minimize the mean absolute 
error of travel time prediction for any route in the road network. We demonstrate the 
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effectiveness of the proposed methodology with a case study using travel time data from 
a district of St. Petersburg city collected over one year. 

Several commercial services for predicting travel time exist, such as Yandex. Traffic^: 
however, we are not aware of any publicly available research, that would systematically 
investigate, how to optimally aggregate predictions made for road segments. 

The paper is organized as follows. Section [2] presents background and related work. 
In Section [3] we discuss alternative optimization criteria for travel time prediction, and 
theoretically analyze expected prediction accuracies. Section[I]presents our methodology 
for obtaining optimal travel time estimates. An experimental case study is presented in 
Section [5] Section [5] concludes the study. 


2 Background 

This section overviews major research directions in travel time prediction from empirical 
data. 


2.1 Possible data sources 

Two types of traffic data sources can be distinguished: static and dynamic. Static 
data comes from sensors fixed on roads, such as, inductive loop detectors, or cameras 
for license plate recognition. Several stationary sensors, installed along a road, can 
estimate how long it takes for a vehicle to travel from one sensor to another. 

Dynamic data comes from sensors (typically GPS) installed in cars. Data can be 
collected by designated probe vehicles driving for the purpose of data collection, service 
fleet, or private vehicles driving on their own business (a.k.a. floating car), equipped 
with GPS receivers. 

Stationary data collection provides a complete traffic view, it counts all the vehicles, 
but it is relatively expensive to deploy, and it is not suitable for tracking vehicles in 
urban environments, where there are lots of small streets and possible turns. A large 
number of stationary sensors would be necessary for following vehicles. On the other 
hand, dynamic data collection, using GPS tracking, can track the exact movement of a 
vehicle no matter how many possible turns there are, but cars that have the necessary 
equipment can be tracked. Dynamic data captures only a sample, not a complete traffic. 

We study travel time prediction in urban environments, hence, we focus on dynami¬ 
cally collected data, and methods for working with such data. 


2.2 Related work 

Travel time prediction from empirical trip data has been studied for over a decade. Table 
[I] presents a summary of representative work in this area. 

Studies differ in data sources, in traffic environment, predictive models and evalu- 


is collected via induction loop detectors Kwon et al. (200(1); Isha 

c and Al-Deek 

20021): 

Rice and van Zwet (2004); Wu et al. (2004); Baiwa et al. (20051); 

Innamaa 

(2005); Gum 

(20061); Fei et al. (2011). cameras Baiwa et al. (2005); Innamaa 

(20051); 

Guin 1 ( 

20061). 

radio-frequency (RF) identification tags Chien and Kuchipudi (2003), oi 

toll stations 


Heilmann et aJj ( 2011 ). In the highway settings forming the prediction target is straight¬ 


forward, because most of the vehicles follow the same route, and plenty of historical data 
is available for modeling from the route in question. In these settings there is no need 
for aggregated predictions. 


1 http://maps.yandex.com/traffic 
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Table 1: Summary of related work. 


Study 

Road 

Data source 

Predictive models 

Evaluation measures 

Kwon et al. ('2000') 

Ishak and Al-Deek 120021 
Chien and Kuchipudi 120031 
Rice and van Zwet 120041 

Wn et a,l. 120041 

Baiwa et al. 120051 

Innamaa (_2005,) 

Quin 120061 

Fci et al. 120111 

Heilmann et al. (2011) 

highway 

highway 

highway 

highway 

highway 

highway 

highway 

highway 

highway 

highway 

loop detectors 
loop detectors 

RFT probe vehicles 
loop detectors 
loop detectors 
loop detectors, cameras 
loop detectors, cameras 

cameras 

loop detectors 
local detector, toll 

stepwise regression, tree, ANN 
GLM 

Kalman filter 
linear regression, kNN 

SVM regression 
pattern matching 

ANN 

ARIMA 

Bayesian 
kernel predictor 

MSPE 

MAPE 

MAPE, RMSEP 

RMSE 

MAPE, RMSEP 
correlation, RMSE, hit ratio 
MAE, RMSE, ME, 

MRE, hit ratio 

MAE, MAPE, RMSEP 
MAE, MAPE, RMSE 

RMSE 

de Fabritiis et al. (2008) 

city 

GPS private cars 

pattern matching and ANN 

MAPE, RMSE 

Vanaiakshi et al. 120091 

city 

GPS busses, probe cars 

Kalman filter 

MAPE 

Markovic et al. 120101 

city 

GPS courier vehicles 

kNN and ARIMA 

MAPE, ME, RMSE 

Westente et al. 120131 

city 

GPS ambulances 

Bayesian model 

RMSE 

Jones et al. 120131 

highway, 

GPS floating car 

SVM regression 

MAPE 


city 




















The most popular measures for travel time prediction accuracy are the mean absolute 
percentage error (MAPE), which is a normalized version of the mean absolute error, and 
the root mean square error (RMSE), or its normalized version RMSEP. Often in research 
studies accuracy is reported using several alternative measures in order to provide a 
more comprehensive view of the results. Accuracy is measured over individual route, or 
segment. We are not aware of any research work investigating optimization criteria or 
evaluation measures for travel time prediction in a road n etwork. 

The scenario considered by Fei et al Fei et all ( 2011 ) is to some extent related to 
our problem setting. The authors aggregate predictions for 66 segments of one highway. 

They use a simple sum of means as the combination rule. They do not investigate any 
alternative rules, and the focus of the paper is not on optimal aggregation methods, as 
is the focus of our paper. We will demonstrate that the mean rule is sub-optimal for 
combining a small number of segments, but approaches the optimum when the number 
of segments is large. Practically, 66 segments is already a large number, hence, the sum 
of means may work reasonably well in this case. 

Sev eral studies model travel t imes in urban environments using GPS dat a lde Fabritiis et ~al 


( 20081): Vanaiakshi et al. (f2009h : iMarkovic et al. ( 2010h : Jones et al. ( 20131) . Vanajakshi 
et al Vanaiakshi et al. ( 20091) predict bus travel times over a test route. The setting is 
similar to a highway setting, where all the vehicles follow the same route, thus, modeling 
data from the same route i s directly available , and t h ere is no need fo r aggre g ated predic¬ 
tions. Other three studies de Fabritiis et al. ( 20081) : Markovic et al. ( 20101 ): Ijones et ahl 
( 2013} ) use floating car data, where vehicles can follow many different routes. However, 
all three studies consider a simplified scenario, where predictions for individual segments 
are made and evaluated individually, there are no aggregated predictions for different 
routes. In comparison, our study considers a more advanced prediction scenario, where 
the goal is to optimize the prediction accuracy not over individual segments, but over a 
set of possible routes. We will demonstrate that the optimization criteria in those two 
scenarios is not the same. 


A study on ambulance arrival times Westgate et al. (120131) uses a Bayesian model 


for estimating travel times over the road network. The model parameters are learned all 
at once for the whole network. Learning such models requires a lot of training samples, 
which is not feasible in our case, where only a small fraction of all cars in the network 
provide data, and data distribution is changing over time, which would require different 
parametrisation at different times of day. 

_Finally, a different line of research develops traffic simulation models (see e.g. i Treiber and Resting 

( 2013 .)). which are mainly used for road planning, transportation logistics, car design 
and manufacturing, but to the best of our knowledge, such models are not used for 
real-time traffic predictions. One of the main limiting factors is that effective predictive 
models would need to know in advance at least where each vehicle is heading, which is 
practically infeasible. 


2.3 Predicting travel time vs. predicting speed 

Majority of related studies aim at predicting travel time, only iHeilmann et all ( 2011 ) 


considers speed prediction. While speed is easier for humans to interpret (we usually 
think about traffic conditions in terms of speed, not travel time), time has an important 
advantage as a target variable for prediction. 

One of the main purposes of traffic prediction is to plan optimal routes for vehicles 
driving in the city. Many criteria for route optimality can be considered, such as route 
length, quantity of fuel used and complexity of driving directions, but the most common 
by far is the total driving time. In a deterministic setting, travel time can always be 
computed from speed, but the task becomes more complex in a stochastic setting, where 
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Table 2: Example: comparing travel speeds vs. travel times. 




Probability 

Travel time 

Speed 

Route A 

Fast traffic 

1/2 

12 min 

60 km/h 

Slow traffic 

1/2 

24 min 

30 km/h 


Expected values 

18 min 

45 km/h 

Route B 

Fast traffic 

1/2 

10 min 

72 km/h 

Slow traffic 

1/2 

40 min 

18 km/h 


Expected values 

25 min 

45 km/h 


the expected time and the expected speed are not related by a strict dependence. A 
model, that is good at predicting expected speed, may be misleading if used for predicting 
travel time, as the following example illustrates. 

Suppose, a driver can take one of two possible routes (A or B) of equal length 12 km. 
Due to traffic conditions (fast or slow traffic), two variants of travel time are possible 
on each route, and they may happen with equal prior probability. Traffic conditions 
on these routes are independent. All possible outcomes of the journey are indicated in 
Tabled We can see that the expected speeds on both routes are equal, but the expected 
travel times differ. Hence, generally it is not possible to deduce expected travel time 
given only expected speed. Therefore, travel time is chosen as the target variable given 
the task to plan the fastest route. 

3 Estimating travel time from historical data 

The ultimate purpose of traffic information service is to help users to find an optimal 
route between two points at a give time. In the urban environment conditions of alter¬ 
native routes are similar, hence, finding an optimal route typically resorts to finding the 
fastest route. A good traffic information service would predict travel times as accurately 
as possible for as many users as possible. Choosing the right optimization criteria in 
this scenario is not trivial. In this section we formally introduce the problem of travel 
time prediction, discuss alternative optimization criteria for travel time prediction, and 
theoretically analyze expected prediction accuracies. 

3.1 Travel time data distribution 

Firstly, let us consider some characteristics of travel time data. The distribution of travel 
times is positively skewed. The travel time over any road segment (of a positive length) 
is always larger than zero. Travel time approaches infinity when travel speed approaches 
zero, i.e. a vehicle drives very slowly. Most of the probability mass is expected to be 
concentrated at small positive values. With such a distribution the median of data is 
typically smaller than the mean. 

The log-normal distribution is a good example of such a distribution. If a random 
variable x is log-normally distributed (x ~ a)), then y = log(a;) has a normal 

distribution. Figure [T] (a) presents example pdfs of the log-normal distribution, and 
(b) presents empirical distributions of travel times in one road segment in St. Peters¬ 
burg observed in November-December 2012. We can see that the empirical traffic data 
distribution resembles log-normal distribution. 
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(a) (b) 



travel time (sec) 


Figure 1: (a) Example log-normal probability density functions (pdf), fx = 0. (b) 

Empirical distributions of travel times on one road segment, each line represents the 
same hour (e.g. 3AM, 6AM,...). 


3.2 Problem setting 


Suppose we have a fixed road network divided into segments (links). For simplicity 
assume that each segment represents one-way traffic from crossing to crossing. Let 
1Z = {t*i, 7*2,..., r n } be a set of road segments. For each segment r we know its length 
l r , and the neighboring segments to which r is connected. 

In addition, for each segment we know travel times of vehicles that are using a mobile 
application for navigation Travel times are extracted from GPS traces. While dividing 
a road network into segments, and mapping GPS traces to the segments is a great 
challenge, it is out of the scope of the current paper, which f ocuses o n data analysis for 
travel time prediction. An interested reader is referred to Lou et al. ( 20091 ). discussing 
some of these challenges. In our study we assume that the road segments and the travel 
times are readily available. 

In our setting vehicle information is anonymous, only travel times over a sequence 
of segments is available, referred to as a trip. It is not possible to know anything about 
the vehicle, or whether several trips originate from the same vehicle. 

Let D = {di, d. 2 , ■ ■ ■, d m } be a set of trips observed, and let S be the index of 
the i th road segment in trip d. Then a trip d can be described as a sequence of road 


segments 

Let t\ d ' > be the travel time on the i th segment in trip d, and kd be the number of 
segments in trip d , then the travel time for the whole trip d is 


TM = ±t?\ ( 1 ) 

i=1 

Our task is, given an intended trip as a sequence of segments, to predict the travel 
time from historical data. At least two practical application scenarios related to this 
task are possible. First, someone is interested to know how long it will take to travel 
from point A to point B. Second, a navigation system is selecting the fastest route from 
point A to point B from several alternative routes. 

Since no information about vehicles or drivers is accessible, we cannot make person¬ 
alized predictions. We can only make predictions for a particular route at particular 
time, assuming an average driver. As mentioned earlier, casting predictions for all pos¬ 
sible routes is impractical, and combinatorially infeasible on any realistically sized road 
network. Thus, the best we can do is make predictions for individual road segments, 
and then aggregate them to get a prediction for the whole trip. The prediction for trip 
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d would be 


T (d) =X>( r ^)’ (2) 

i= 1 

where r(r 7 ) is the predicted travel time for road segment j, and is the index of the 
i th road segment in trip d. We will refer to this approach as additive prediction. 

3.3 Optimization criteria 

Defining the optimization criteria, and selecting an informative evaluation measure in 
the additive prediction scenario is not trivial. From the route planning perspective, for 
any new trip d the predicted travel time T^ should be as accurate as possible. Many 
alternative accuracy measures can be considered, as seen in Table [TJ 

Root Mean Squared Error (RMSE) is a popular loss function for predictive modeling 
because of its convenient analytical properties. RMSE is the square root of the Mean 
Squared Error (MSE), which is defined as 

1 m 

MSE = — J2(T {d) - T (d) ) 2 , (3) 

171 d= 1 


where T denotes the predicted value T denotes the true observed value, and m is the 
number of trips, on which evaluation is made. The lower MSE/RMSE, the better. 
MSE/RMSE punish large deviations of predictions from the true values. 

Mean Absolute Error (MAE) is an alternative measure, defined as 

1 m 

MAE = — V \f( d) -T (d) |, (4) 

777 ' 


the lower, the better. 

While opimizing RMSE would minimize large deviations from the truth, minimizing 
MAE would give a larger number of correct predictions to more drivers, and, thus, would 
be more valuable for more customers. 

More formally, suppose we have two alternative routes A and B with travel times T A 
and T b independently distributed with different distributions. We would like to choose 
the one that is more likely to be the fastest. It can be shown that if T A ~ F(Af(^A, o' a)) 
and T b ~ F(JV(pb,ob)), where F is some monotone function, then we should choose 
the route with the lowest median travel time (a proof can be found in the Appendix, 
Proposition [3) • For predicting the median we need to use MAE criterion, as it will be 
demonstrated in the next subsection. 

Another reason to focus on predicting the median (and hence using MAE) is a 
compromise between different target variables that it offers. We have already mentioned 
that the expected travel time and expected speed are not related by a strict dependence, 
and therefore it is impossible to build one model that predicts both well. On the other 
hand, the relation is straightforward for the median: the median speed is exactly the 
length of the route divided by the median travel time. Thus, if we manage to predict 
the median travel time well, we can automatically achieve good results in predicting the 
median speed. 

Hence, we recommend adopting MAE as the main optimization criteria, and focusing 
on predicting the median travel time. 
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3.4 The best possible predictions 

Next, let us consider, what is the best possible MAE, that can be achieved in urban travel 
time prediction. Recall the restriction that no information about individual vehicles or 
individual drivers can be used, we need to output one prediction that fits everybody. 

From Eq.(Hl), the lower bound for MAE is zero , which happens when T^ = T^ 
for all d in m, i.e. the predictions are equal to the observed times for all the test trips. 
Suppose there is an oracle, that knows the future. Could the oracle achieve MAE = 0? 

We will analyze three different scenarios: 

1. road network consists of one segment, 

2. many segments, but all the vehicles follow the same route, 

3. many segments, and vehicles follow different routes. 

3.4.1 One segment 

For a start, suppose that the road network has only one segment. Then Eq. (£2) becomes 

m 

MAE=-Y.\T-tW\, 

d= 1 

where r is the prediction for the segment, and t ^ is the travel time for trip d. Since r 
is the same for all trips, r may be equal to t < ' d> for all d, and, in turn, MAE equal to 
zero only in one case, which is when the travel times for all the trips are the same. In 
such a case prediction is trivial. In any practical case, travel time of different vehicles 
over one segment varies, and, hence the minimum MAE cannot be zero. 

It can be proven that with one road segment the prediction r = medi an(t^) mini¬ 
mizes MAE, since median minimizes the sum of absolute deviations (see e.g. ISchwertman et al 
( 199(j) for a proof). 

3.4.2 Many segments, single route 

Now suppose the the road network has k segments, but each trip follows the same route. 

As a result, the prediction is the same for all trips T = Yhi= i T ( r i)- I n this case Eq. £1]) 
becomes 

1 m 

MAE = — V |T-T (d) |, 

d= 1 

where T is the predicted travel time for the route, and T^ is the observed travel time 
for trip d. 

Following the same argument as in one segment case, the minimum MAE will be 
achieved, with the prediction T = median(T^). Since all the trips follow the same 
route, it is straightforward to compute the median of T^ d \ which is the median of the 
observed travel times for the route. 

This scenario applies well to monitoring highways where traffic detectors are installed 
every few kilometers, and most of the vehicles are continuing through all the highway. 
However, this scenario is not very realistic in urban environment, where each vehicle 
may be following a different route. Hence, for urban environment we need to consider a 
more complex scenario. 
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Figure 2: Empirical distribution of trip lengths according to the number of road segments 
(the total number of segments in the road network is 106). 


3.5 Many segments, different routes 

If a road network has many segments, and each vehicle follows a different route, for a trip 
d, the optimal prediction would be T^ = median (T ^), as discussed in the previous 
section. The problem now is that we do not have enough observations of trips following 
the same route, or have no observations of such trips at all. Thus, we cannot compute 
median(T^) directly. 

We could estimate the medians of each segment separately, and then add them up, 
but unfortunately, the sum of the medians is not generally equal to the median of the 
sum. Table [3] in the Appendix presents a proof by example. While the mean of the 
sum is, in fact, equal to the sum of the means, this does not help much. As we have 
discussed, travel data distribution is skewed, and in such a case the mean is not equal to 
the median, so we cannot easily reuse the mean for the median either. Hence, our task 
resorts to modeling the median of a sum of random variables in the context of travel 
time estimation. 

There have been many research attempts to model th e su m of th e medians for various 
data distributions under different assumptions (see e.g. Hall (1980:)). We are not aware 
of existence of and analytical solution for a small number of variables, which is relevant 
to travel time prediction. Typically trips consist of a small number of segments, as it 
can be seen in Figure [2j which presents an empirical distribution of trip lengths on a 
road network of 106 segments in St. Petersburg, recorded in November-December 2012. 
There is a peak at 44 segments, accounting for nearly 6% of the trips, this is due to the 
main road in the network. 

In summary, for an urban road network consisting of many segments, where vehicles 
follow different routes, we cannot obtain optimal travel time estimates (median over 
each route) directly. Hence, we need an approach for estimating the median of the sum 
that would work well with small number of segments. In the next section we present 
our methodology for deriving a data driven approximation for travel time estimation in 
such a scenario. 


4 Methodology for aggregated travel time prediction 

This section describes our methodology for travel time prediction in the urban settings 
with many road segments, and different routes. The idea is at first to make estimates of 
mean and median travel times for each road segment. Then aggregate those estimates 
into a prediction for each route of interest. The simplest approach for obtaining the esti¬ 
mates is to take the mean and median over the last observed travel times. The remaining 
challenge is how to combine those estimates. We propose the following solution. 
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4.1 Approach to solution 

Suppose a trip consists of k segments. The total travel time is a sum of travel times over 
each segment T = tj. We are looking for an estimate T, which would minimize 

MAE. The solution is based on the following observations. 


1. When k = 1 then MAE is minimized with the median over observed travel times 
T = median(ti) (Sec. [3]). 


2. When k —> oo, assuming that travel times over each segment are identically and 
independently distributed (HD). , T approaches the normal distribution (Central 
Limit Theorem), thus median(T) « mean(T), and mean(Y^i=i U) = Y^i =i mean(ti). 
MAE is minimized with T = mean(ti). 


3. Observe that for positively skewed random variable t the expected sum of the 
medians does not exceed the expected median of the sum, y ,,_-i median(ti) < 

median(Yl!i =1 U)- 


4. For positively skewed random variable t, such as travel time fSection l3.1D . median 
of the sum is smaller than the mean of the sum Siegel ( 200 il l. median(^2, k i=1 U) < 

meanx=i U)- 


Distributions of travel time over different segments may depend on time of day, 
resulting in different distributions at different times. We consider that for a given time, 
travel time observations can reasonably be assumed to be independent from each other. 

The IID assumption does not exactly hold for the segments in our data, since the 
segments have different lengths, but, as we will see in the experimental analysis, this 
assumption gives a good approximation. Moreover, when a route is sufficiently long, 
one can always partition it to segments of equal, and sufficiently large, length such that 
travel times over them are identically and independently distributed. 

In summary, the median is smaller than the mean, but the median approaches the 
mean when the number of segments ( k ) becomes large. The solution for k = 1 is the 
median, for k —> oo is the mean, hence, the solution for small positive k should be in 
between of the median and the mean. 


4.2 Solution - a combination of mean and median 

Based on these observations, our proposed solution is to model the optimal travel time 
estimate as a weighted average of the individual means and medians over segments: 

k k 

T = (1 — Wk) median(ti) + Wk mean(ti), (5) 

i=1 i= 1 

where Wk G [0,1] is a weight. If sample sizes are sufficiently large to accurately estimate 
the median, this approach guarantees an optimal solution, proof can be found in the 
Appendix, Proposition [2] But as we will see from the experiments in Section [5j the 
proposed method shows good performance even for small sample sizes. 

When k = 1, uq = 0. When k —> oo, -» 1. For the rest of k we model Wk as 
a function of k as follows. We randomly generate routes over the actual road network, 
sample travel times for these routes from the observed data, compute the median travel 
times and optimal Wk for each route. This way we generate a semi-synthetic dataset, on 
which we can learn Wk = f(k) using some machine learning method, that could capture 
a non-linear relation from data, for example, Artificial Neural Networks (ANN). 
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Figure 3: Wk = f(k ) learned with ANN. Solid lines denote estimated functions, circles 
denote out-of-sample test data. 


4.3 Generating data for learning the weight function 

We need to generate a dataset, where the input variable is k and the target variable is 
Wk- To generate one data point we: 

1. select k uniformly at random from a range [1, k max ], where k max is the maximum 
length of a route, it depends on the considered road network; 

2. randomly generate a track: select one segment at random, select the next segment 
uniformly at random from those segments connecting to the first one, continue 
until k segments are selected; 

3. for each segment on the generated track randomly pick one observation of travel 
time from the historical data, sum the selected travel times over k segments, repeat 
this h times to obtain h trips, 

4. compute the true median over h trips, 

5. let Wk run from 0 to 1 (grid search), and select Wk , which gives the minimum 
absolute deviation of the estimate computed as in Eq. © from the true median, 
computed in the previous step. 

This procedure gives one data point. We generate N such data points, and use them 
for modeling Wk as a function of k. 


4.4 Learning the weight function 

Once we have generated a dataset for relating k and Wk , we can proceed in two ways: 
we can construct a look-up table, where for each k we can list a corresponding Wk, 
or we can find a functional form Wk = f{k) using a machine learning method, for 
example, ANN. As an example, illustrating that a nice functional form of Wk exists, 
Figure[3]plots Wk = f{k) learned on synthetic data sampled from log-normal distribution 
lnAf(Q,s). We can see that different data distributions give different functions, but 
the learned functions give accurate approximations, as tested on out-of-sample data. 
While the weight function is different for different distributions (with different standard 
deviations), we can consider that for a given road network of interest the distribution is 
fixed, and therefore, one estimated weight function can be learned and applied to that 
road network for a given time period. 


4.5 Travel time prediction scenario 

After learning the function 

Wk = f{k ), we recommend the following procedure for travel time prediction. 
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1. Estimate the mean and the median travel times for each segment on the road 
network. 

2. For each trip of interest compute an aggregated prediction using Wk, as specified 
in Eq. ©. 

Separate Wk can be learned and used, for instance, for different times of the day. 
In case more complex machine learning approaches are used for travel time prediction, 
including, e.g., time of the day, holiday, weather information as input features, we suggest 
building two models per segment: for predicting the mean, and predicting the median. 
Then, for a trip of interest combine the predictions using Eq. [5] with a fixed function 
Wk = f{k). 


5 Experimental analysis 

We analyze the performance of the proposed approach with a case study from St. Pe¬ 
tersburg city. The goal is to compare the performance of the new approach to the 
performance of two baselines: sum of medians and sum of means. 

5.1 Dataset 

We use one year (2012) of GPS data, collected via a mobile application Yandex.Navigator. 
The dataset covers an interconnected network of streets in St. Petersburg. The network 
consists of 106 road segments. 1 132 277 trips are recorded in the dataset, covering over 
14M records. The dataset is extensive in number of records and the time covered, and, 
we believe, is as representative of the traffic population in St. Petersburg, as it could 
be. Figure 0 presents an empirical distribution of trips in terms of number of segments. 

5.2 Experimental protocol 

To minimize possibilities of overfitting, only January-February data is used as train¬ 
ing data for generating route data in order to learn the weight function, and Marcli- 
December data is used for testing the predictive performance. 

We operate in discrete time steps A, over which we collect and summarize input 
data, and for which we make predictions. For example, if A = lOmin, and now is 14:00 
o’clock, we would estimate the mean and the median travel time on each segment from 
the time interval 13:50-14:00, and use it for making predictions for the time interval 
14:00-14:10. 

A trip is included into mean and median estimation after it has finished. That is, if, 
for instance, a trip starts at 13:48 and ends at 13:55, it is included into the estimation 
interval 13:50-14:00. Predictions are based on the data accumulated before a test trip 
starts. That is, if a test trip starts 14:07 and ends 14:15, data from the interval 13:50- 
14:00 is used for making predictions for the whole test trip. 

We compare the performance of the proposed combined approach (COM) with two 
baselines: the sum of means (SMN) and the sum of medians (SMD). Where technically 
possible, we also show the performance of the true median prediction (MED), which 
gives theoretically optimal prediction. 

The true median (MED) is the theoretically optimal solution, which is only possible 
if we have many observed trips following the same route. It is possible to estimate MED 
on particular routes, when many vehicles follow the same route, but it is not feasible to 
estimate MED for most of the routes on the road network. The next best we can do is to 
approximate MED by combining the medians and the means over individual segments 


12 


(COM), which is our proposed solution, and is expected to give an accuracy close to 
that of MED. Finally, SMN and SMD are two baseline approaches. SMD is expected to 
perform well on routes consisting of a small number of segments, and SMN is expected 
to perform well on the routes consisting of a large number of segments. 

For COM we estimate the optimal combination weight Wk by letting Wk run from 0 
to 1 and selecting the one, which gives the minimum estimation error, as described in 
Sec. 14.41 

We standardize the prediction errors by the trip lengths, such that the figures are 
comparable across different samples, as 

m 

MAE* = I T (d) - T (d) |/L (d) , (6) 

d=l 

where L^ is the length of trip d (in km), L^ = Yi =i lr ( d) ■ Note, that the standardized 

MAE* relates to MAE in Eq. Has MAE* = mMAE/*£™=i L(d) - 

For easier visual interpretation and comparison across different aggregation times we 
plot MAE relative to the performance of the sum of the means baseline (SMN) that 
is currently being used in practice as the state of the art approach. Relative MAE is 
MAE of the approach of interest divided by MAE of the baseline. If the relative MAE 
is smaller than one that means that the approach of interest is performing better than 
the baseline. Since relative MAE incorporates the baseline, from the result we know 
how good the methods are from the global perspective. 

5.3 Results: many segments, one route 

The goal of this experiment is to verify, whether the proposed combined approach pro¬ 
vides (COM) a better estimate for travel times than the baselines (SMN and SMD). 
In this experiment we use only a subset of data, consisting of trips over the main road 
(nearly 6% of the trips in the dataset), and refer to this subset route as Route44. Since 
this route is popular, we have a number of traces following this particular route, and 
thus we are able to compute the theoretically optimal prediction, which is the median 
travel time over the whole route (MED). Thus, we can also investigate, how close the 
proposed approach (COM) comes to the theoretically optimal prediction (MED). 

While the main route consists of 44 segments (a fixed k = 44), we can also estimate 
Wk for any k < 44 by considering a shorter sub-section of the main road, for instance, 
when k = 1 we would consider only the first segment of the main route, and when k = 2 
we would consider the first two segments. For each k we select one sub-segment, which 
starts at segment #1 and ends at segment 

5.3.1 Learning the weight function 

Figure H presents an estimation error as a function of different possible weights Wk■ We 
can clearly see that there is a global minimum in each case, suggesting that there is a 
weight for combining the sum of means and and the sum of medians in an optimal way. 
Since the minimum of the solid line overlaps with the dashed line, we can conclude that 
the combination approach can give the minimum error solution, that can be achieved 
by knowing the true median. 

The resulting weights Wk for Route44, estimated on the training data (Jan-Feb 2012), 
is presented in Figure [5] The learned weights are not monotonous over increasing k, 
perhaps, due to varying length of segments across the route. Next, we will use this Wk 
for predicting travel time on unseen data. 
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Figure 4: Estimation error as a function of differ¬ 
ent weights u>k (solid line), and true median error 
(dashed). 


Figure 5: Learned weights 
Wk for trips of length k on 
Route44. 




Figure 7: Learned weights wk on all 

„ i t ,. routes, estimated on Jan-Feb 2012 data. 

1 igure b: Relative prediction errors on 

Route44 on test data Mar-Dec 2012. 


5.3.2 Prediction accuracy 

Figure [6] plots the prediction errors of four alternative approaches. The discretization 
in this experiment is A = 120 min, thus, the prediction horizon is 0-120 min. The 
discretization step is chosen in this experiment such that several samples are available 
in time slot in each segment. 

We can see that the proposed approach COM clearly performs better than the base¬ 
lines SMN and SMD. At small k COM performs slightly better than SMD, and SMN 
performs much worse, and at larger k COM performs slightly better than SMN, and 
SMD performs much worse, as expected. Hence, COM combines the advantages of the 
two baselines. Moreover, COM performs as good as the theoretically optimal MED, 
which is a good news, since MED is rarely feasible to compute in practice, and COM 
shows to be a good approximation. Interestingly, sometimes COM perforins slightly 
better than MED. That can be explained by small sample sizes from which MED is 
estimated, in which case the sample median is not as precise as the weighted average 
obtained by COM using the learned weights Wk- 

5.4 Results: many segments, different routes 

Next, we analyze how the proposed approach performs on a road network, consisting of 
many segments and different routes. 

5.4.1 Learning the weight function 

Figure [7] presents combination weights Wk, learned using artificial neural network with 
four hidden layers. The weights monotonically increase with fc, as expected, since the 
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Figure 8: Testing errors on the whole route network month-by-month. 


median of the sum approaches the mean of the sum, as discussed in Sec. [4] We see low 
weights at small k, which means that when a trip consists of small number of segments, 
the sum of the medians SMD dominates. When the number of segments per trip reaches 
about 10, the sum of means SMN starts to dominate. 

5.4.2 Predictive performance 

Figure [8] plots predictive performance month-by-month for different discretization steps 
A. The prediction horizon corresponds to the discretization step A, i.e. if A = 10 min, 
the prediction horizon is 0-10 min. We see that the proposed combination approach COM 
consistently outperforms both baselines at different discretization steps (and prediction 
horizons) over the course of year. 


6 Conclusion 

In this study we analyzed optimization criteria for travel time prediction in urban envi¬ 
ronments, where vehicles follow many different routes. We proposed a methodology for 
aggregated travel time prediction, which interactively combines the mean and median 
estimates of travel times over individual trip segments into a single prediction for the 
whole trip. Experimental results demonstrated that the proposed approach consistently 
outperforms the current baselines. 

Based on the results, we recommend using the proposed combination of the sum 
of means and sum of medians of individual road segments for constructing aggregated 
predictions. 

This study opens several interesting directions for further research. We have focused 
on optimizing the mean absolute error of predictions. We have argued that while this 
quantitative optimization criteria may occasionally permit large errors, it favors the 
scenario to predict accurately for as many users as possible. From the practical perspec¬ 
tive, it would be interesting to consider, and possibly integrate, multiple optimization 
criteria. For instance, a designer may want the system to focus on accurately predicting 
travel time over longer routes, while errors on short routes do not matter that much. 
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Another interesting extension would be to consider asymmetric costs of errors, where, 
for instance, overprediction of travel time is tolerated better than underprediction. 
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A Proofs 

Proposition 1. Sum of the medians is not necessarily equal to the median of the sum. 

Proof. Proof by example. Example data is presented in Table [3] From the table 
median(sum(r )) = 24, but sum(median(r)) = 5 + 7 + 8 = 20. □ 

Proposition 2. Given the weight parameter Wk G [0,1], the following model exactly 
estimates the median of the sum for positively skewed identically and independently dis¬ 
tributed t 


k k k 

median (^^ tf) = (1 — Wk) medianfti) + Wk meanfti), 

i= 1 i= 1 2—1 

where t t is the travel time over segment i, k is the number of segments, Wk G [0,1] is 
the weight parameter. 

Proof. The expression can be rearranged into 

median(Yli=i U) ~~ Yli—i median{ti) 
i meanfti ) - J2i= i median(ti) 

From the definition of mean the following identity holds meanft.f) = meanlffTo^ ti). 
From Sec. \i\median(Y^i=i t%) < meanfti), and JA =1 median(ti) < median (X]f =1 ti)- 
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After plugging in these inequalities into the expression above we get 0 < Wk < 1, which 
means that given the right Wk we can approximate the median of the sum as a combi¬ 
nation of the sum of the medians and the sum of the means. □ 

Proposition 3. Let F be monotone function, 

T a ~ F(N(pa,<J 2 a )) and T B ~ F(J\f(pB,cr%)) be two independent random variables. 
Then p(T A > T B ) > \ median(T A ) > median(T B ) 

Proof. First, let us prove the proposition in a special case, where F is the identity 
function, i.e., T A and T B are normally distributed. 

In this case median(T A ) = p A , median(T B ) = ps- p[T A > T B ) = p(T A — T B > 
0). Denote T = T A — T B . Since T A and T B are independent, T is also normally 
distributed with mean p = p A — Pb and standard deviation er = \Ja\ + . p(T > 
0) = | Thus, p{T > 0) > \ is equivalent to er f{-^^) > 0, which in 

turn is equivalent to p > 0, because the error function is odd. Since p = p A — Pb , 
p(T A > T b ) > \ pa > Pb- That proves the proposition in the special case. 

Now let us prove the general case. For any monotone function F there exists the 
inverse which is also monotone. Let us denote t A = F~ l (T A ), t B = F _1 (T S ). 
Since F" 1 is monotone, T A > T B t A > t B . From this fact two conclusions can be 
drawn. 

First, p{T A > T b ) = p(t A > t B ). 

Second, p(t A > F _1 (median(T A ))) = p(T A > median{T A )) = in other words, 
median(t A ) = F~ 1 (median(T A )). The same holds for t B and T B . 

Now, if median(T A ) > medianlT 3 ), then median(t A ) > median[t B ). t A and t B 
are normally distributed variables, so as proved above, p{t A > t B ) > But since 
p(T A > T b ) = p{t A > t B ), then p{T A > T b ) > 

Conversely, if p{T A > T B ) > \ then p(t A > t B ) > |, hence median(t A ) > 
median{t B ) and median(T A ) > median(T B ). □ 
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